-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move resource optimisation changes to Production #4426
Conversation
📝 WalkthroughWalkthroughThis PR makes extensive modifications to Kubernetes configuration files across multiple services. The changes include adjustments to resource limits and requests, with several services now incorporating node affinity and nodeSelector settings to influence pod scheduling. Additionally, obsolete job configurations and multiple Helm chart files (deployment, HPA, service, tests, and associated helper templates) for the data-proxy application have been removed. Changes
Sequence Diagram(s)sequenceDiagram
participant Pod as Configured Pod
participant API as Kubernetes API Server
participant Scheduler as Kubernetes Scheduler
participant Node as Target Node
Pod->>API: Submit Pod spec (includes nodeSelector & affinity)
API->>Scheduler: Forward pending Pod
Scheduler->>Node: Evaluate Node (match nodeSelector and affinity)
Note right of Scheduler: Check resource limits and scheduling rules
Scheduler->>Pod: Assign Pod to matching Node
Node-->>API: Report Pod running
Possibly related PRs
Suggested reviewers
Poem
Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## staging #4426 +/- ##
========================================
Coverage 11.24% 11.24%
========================================
Files 156 156
Lines 18006 18006
Branches 388 388
========================================
Hits 2025 2025
Misses 15979 15979
Partials 2 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
k8s/locate/values-stage.yaml (1)
20-30
: Consider removing redundant node selection configuration.The
nodeSelector
andnodeAffinity
configurations are both targeting nodes withrole: moderate-usage
. SincenodeSelector
is a hard requirement, thepreferredDuringSchedulingIgnoredDuringExecution
affinity becomes redundant.Consider keeping only the
nodeSelector
for a simpler configuration:nodeSelector: role: moderate-usage -affinity: - nodeAffinity: - preferredDuringSchedulingIgnoredDuringExecution: - nodeSelectorTerms: - - matchExpressions: - - key: role - operator: In - values: - - moderate-usagek8s/view/values-prod.yaml (1)
22-33
: Reconsider high-mem node selection.The service is configured to run on high-mem nodes despite having relatively modest memory requirements (420Mi limit). Consider using moderate-usage nodes instead.
nodeSelector: - role: high-mem + role: moderate-usage affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: role operator: In values: - - high-mem + - moderate-usagek8s/spatial/values-prod.yaml (1)
20-31
: Consider simplifying node targeting strategy.The configuration uses both nodeSelector and nodeAffinity for high-cpu nodes, which is redundant. Since nodeSelector is a hard requirement, the nodeAffinity preference doesn't add value here.
Consider using either:
- Just nodeSelector for hard requirement:
nodeSelector: role: high-cpu -affinity: - nodeAffinity: - preferredDuringSchedulingIgnoredDuringExecution: - - preference: - matchExpressions: - - key: role - operator: In - values: - - high-cpu - weight: 1
- Or just nodeAffinity for flexible scheduling:
-nodeSelector: - role: high-cpu affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: role operator: In values: - high-cpu weight: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (26)
k8s/analytics/values-prod.yaml
(1 hunks)k8s/analytics/values-stage.yaml
(0 hunks)k8s/auth-service/values-prod.yaml
(2 hunks)k8s/calibrate/values-prod.yaml
(1 hunks)k8s/data-mgt/values-prod.yaml
(1 hunks)k8s/data-proxy/.helmignore
(0 hunks)k8s/data-proxy/Chart.yaml
(0 hunks)k8s/data-proxy/templates/NOTES.txt
(0 hunks)k8s/data-proxy/templates/_helpers.tpl
(0 hunks)k8s/data-proxy/templates/deployment.yaml
(0 hunks)k8s/data-proxy/templates/hpa.yaml
(0 hunks)k8s/data-proxy/templates/service.yaml
(0 hunks)k8s/data-proxy/templates/tests/test-connection.yaml
(0 hunks)k8s/data-proxy/values-prod.yaml
(0 hunks)k8s/data-proxy/values-stage.yaml
(0 hunks)k8s/device-monitor/values-prod.yaml
(1 hunks)k8s/device-registry/values-prod.yaml
(1 hunks)k8s/incentives/values-prod.yaml
(1 hunks)k8s/locate/values-prod.yaml
(1 hunks)k8s/locate/values-stage.yaml
(1 hunks)k8s/meta-data/values-prod.yaml
(1 hunks)k8s/predict/values-prod.yaml
(1 hunks)k8s/spatial/values-prod.yaml
(1 hunks)k8s/view/values-prod.yaml
(2 hunks)k8s/website/values-prod.yaml
(1 hunks)k8s/workflows/values-prod.yaml
(1 hunks)
💤 Files with no reviewable changes (11)
- k8s/data-proxy/templates/service.yaml
- k8s/data-proxy/templates/hpa.yaml
- k8s/data-proxy/.helmignore
- k8s/data-proxy/templates/tests/test-connection.yaml
- k8s/analytics/values-stage.yaml
- k8s/data-proxy/values-prod.yaml
- k8s/data-proxy/values-stage.yaml
- k8s/data-proxy/templates/deployment.yaml
- k8s/data-proxy/Chart.yaml
- k8s/data-proxy/templates/_helpers.tpl
- k8s/data-proxy/templates/NOTES.txt
🧰 Additional context used
🪛 YAMLlint (1.35.1)
k8s/analytics/values-prod.yaml
[error] 25-25: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript)
🔇 Additional comments (17)
k8s/locate/values-prod.yaml (2)
20-30
: Same redundant node selection configuration as staging environment.Similar to staging, the
nodeSelector
makes thenodeAffinity
preference redundant.
14-19
: Verify the significant memory reduction impact.The memory limits and requests have been substantially reduced:
- Memory limit: 500Mi → 120Mi
- Memory request: 300Mi → 80Mi
Please ensure that these reductions are based on actual usage metrics and won't cause OOM issues in production.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check memory usage trends for locate-api pods kubectl -n production top pod -l app=locate-api --containersLength of output: 160
Memory Reduction Impact Verification Request
It appears the command did not execute as expected due to a missing
kubectl
installation in our current environment. Without actual memory usage metrics from production, we cannot confirm whether the significant memory reductions (from 500Mi to 120Mi for limits and 300Mi to 80Mi for requests) might lead to potential OOM issues. Please manually verify the production memory usage metrics using an environment wherekubectl
is available or check via your monitoring tools.k8s/data-mgt/values-prod.yaml (1)
18-19
: Review extremely low CPU request.The CPU request of 1m is unusually low and might affect the pod's Quality of Service (QoS) class. Consider setting a more realistic value based on actual usage patterns.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check current CPU usage for data-mgt pods kubectl -n production top pod -l app=data-mgt-api --containersLength of output: 162
Manual Verification Required for CPU Metrics
It appears that the attempt to execute the
kubectl
command in our sandbox environment failed because thekubectl
tool isn’t available. To accurately assess the CPU usage for the data-mgt pods and validate if a 1m CPU request is indeed too low for maintaining an appropriate QoS, please verify the current CPU consumption using a proper Kubernetes environment.Could you manually run the following command in your production environment and share the results?
kubectl -n production top pod -l app=data-mgt-api --containersThis will help confirm whether the extremely low CPU request might be impacting QoS.
k8s/incentives/values-prod.yaml (1)
14-19
: Memory allocation increases align with high-mem node targeting.The increased memory allocations make sense given the high-mem node targeting:
- Memory limit: 500Mi → 1000Mi
- Memory request: 250Mi → 500Mi
This aligns well with the service's resource needs.
k8s/view/values-prod.yaml (2)
44-44
: Verify increased CPU utilization target impact.The increase in targetCPUUtilizationPercentage from 70 to 80 will delay horizontal scaling. Ensure this aligns with the service's performance requirements and scaling needs.
16-21
: Validate significant resource reductions.The resource allocations have been substantially reduced:
- CPU limit: 100m → 20m
- Memory limit: 1500Mi → 420Mi
- Memory request: 500Mi → 300Mi
Please ensure these reductions are based on actual usage patterns.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check resource usage trends for view-api pods kubectl -n production top pod -l app=view-api --containersLength of output: 158
Manual Verification Required for Resource Usage Trends
The automated check using
kubectl
didn’t run as expected due to the missingkubectl
command in this environment. Please run the command manually in an environment wherekubectl
is available or verify the resource usage trends for the view-api pods in production using an alternative method. This will help validate whether the significant reductions in resource limits and requests align with actual usage.k8s/device-monitor/values-prod.yaml (2)
18-22
: Resource allocation changes look appropriate.The significant reduction in CPU limit (50m) with increased memory allocation (400Mi request) aligns well with memory-intensive workloads that don't require much CPU processing.
23-24
: Verify high-memory node availability in production.The nodeSelector targeting high-mem nodes is appropriate for the memory-intensive workload, but we should ensure sufficient nodes are available.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check available nodes with high-mem role kubectl get nodes -l role=high-mem --context productionLength of output: 155
Action Required: Manually Verify High-Memory Node Availability
The command didn't produce output because our sandbox environment lacks the
kubectl
binary. Please run the following command in a properly configured production environment to ensure that nodes labeledrole=high-mem
are available:kubectl get nodes -l role=high-mem --context productionThis manual check is necessary to confirm that the nodeSelector in
k8s/device-monitor/values-prod.yaml
will function correctly in production.k8s/auth-service/values-prod.yaml (2)
15-19
: Resource limits appropriately increased for auth workload.The increased CPU (400m) and memory (600Mi) limits with corresponding request adjustments reflect growing authentication service demands.
48-48
: Consider implications of higher memory utilization threshold.Increasing targetMemoryUtilizationPercentage to 85% may delay scaling events. Monitor memory usage patterns to ensure this doesn't impact service performance.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check historical memory usage patterns kubectl top pods -n production -l app=airqo-auth-api --context productionLength of output: 173
Action Required: Manually verify memory usage metrics
The automated shell script did not yield any output due to the absence of the
kubectl
command in our environment. Since we couldn’t retrieve historical memory usage patterns, please run the providedkubectl top pods
command in an environment wherekubectl
is installed and properly configured, or use your monitoring tools to confirm that setting the target memory utilization to 85% does not adversely affect scaling events.k8s/device-registry/values-prod.yaml (1)
15-19
: Verify significant resource reduction impact.The substantial decreases in resource limits (CPU: 600m → 200m, Memory: 2000Mi → 500Mi) require careful validation.
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check historical resource usage kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods?labelSelector=app=airqo-device-registry-api" | jq .Length of output: 226
Critical Resource Allocation Re-evaluation Required
The current command couldn’t fetch historical usage data because the
kubectl
tool isn’t available in the environment. Given the significant reduction in CPU (600m → 200m) and Memory (2000Mi → 500Mi), please manually verify that these limits still align with the service's actual resource consumption in production.
- Confirm the node metrics are available or re-run the command in an environment with
kubectl
installed.- Validate the new resource values against historical usage data to ensure stable performance.
k8s/meta-data/values-prod.yaml (2)
22-33
: LGTM: Node affinity configurationThe node affinity configuration preferring high-mem nodes aligns well with the memory-focused resource profile of this service.
17-21
: Review CPU allocation settingsThe CPU request of 10m seems very low for a production service. While this allows for better resource sharing, it might impact service performance during high load.
Consider validating the service performance with these CPU settings:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check HPA metrics and events for any CPU-related scaling issues kubectl get hpa -n production airqo-meta-data-api -o yaml kubectl get events -n production --field-selector involvedObject.kind=HorizontalPodAutoscaler,involvedObject.name=airqo-meta-data-apiLength of output: 391
Manual verification required for production CPU metrics
The provided shell script did not yield any output because the
kubectl
command isn't available in this sandbox environment. Please verify the HPA and event metrics directly in your production environment (or another suitable environment with access tokubectl
). This will help confirm if the CPU request of 10m is impacting the service’s scaling or performance.k8s/calibrate/values-prod.yaml (1)
19-23
: Verify the significant memory reductionThe memory changes appear risky:
- Memory limit reduced by 87% (1500Mi → 200Mi)
- Memory request reduced by 98% (950Mi → 20Mi)
These drastic reductions might impact the service's stability.
Let's check historical memory usage:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check memory usage patterns kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/production/pods | jq '.items[] | select(.metadata.name | startswith("airqo-calibrate-api")) | .containers[].usage.memory'Length of output: 279
Action Required: Manual Verification of Memory Usage
The attempted script did not produce any output due to the absence of the
kubectl
command in our sandbox environment. Unfortunately, without runtime metrics from the cluster, we need to manually verify the impact of reducing the memory limits (from 1500Mi to 200Mi) and requests (from 950Mi to 20Mi) on the calibrate service's performance and stability.
- Verify that the current memory usage of the calibrate service in production supports these lowered settings.
- Check if there are alternative methods (or manual cluster inspection) available to measure memory consumption reliably.
Please perform a manual check or use an environment where
kubectl
is accessible to ensure that the service's operational memory requirements are met.k8s/predict/values-prod.yaml (1)
19-35
: LGTM: Well-balanced resource optimizationThe changes demonstrate a good balance:
- Moderate CPU reduction
- Slight memory increase
- Appropriate node affinity for moderate resource usage
k8s/analytics/values-prod.yaml (1)
20-21
: Verify the aggressive CPU limit reductionThe CPU limit has been reduced by 95% (1000m → 50m), which seems very aggressive for an analytics service that likely performs data processing.
Let's check CPU utilization patterns:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check CPU usage patterns kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/production/pods | jq '.items[] | select(.metadata.name | startswith("airqo-analytics-api")) | .containers[].usage.cpu'Length of output: 276
Action: Please manually verify the CPU usage patterns in production.
The command did not produce output because
kubectl
is not available in this environment, so we couldn’t automatically fetch the metrics. Given that the YAML clearly sets the CPU limit at 50m (a 95% reduction from the previous 1000m), it's important to manually verify that production pods (specifically those with names starting with "airqo-analytics-api") are not CPU-starved during data processing.
- Location:
k8s/analytics/values-prod.yaml
(lines 20-21)- Concern: Aggressive CPU limit reduction may affect performance.
Please run the metrics query in an environment where
kubectl
is accessible or verify the actual CPU usage on production to ensure the new limits meet the operational needs.k8s/workflows/values-prod.yaml (1)
41-45
: Review Redis memory allocationThe Redis memory settings seem too restrictive for production:
- Memory limit: 50Mi
- Memory request: 10Mi
These values might be insufficient for Redis to handle workflow data effectively.
Let's check Redis memory usage:
❌ Verification inconclusive
🏁 Script executed:
#!/bin/bash # Check Redis memory usage kubectl exec -n pipeline -it $(kubectl get pods -n pipeline -l app=workflows,component=redis -o jsonpath='{.items[0].metadata.name}') -- redis-cli info memoryLength of output: 356
Manual Verification Required for Redis Memory Usage
The command did not produce output because
kubectl
wasn't found, so we could not validate the Redis memory usage automatically. Please run this check manually in an environment wherekubectl
is available. In the meantime, consider verifying that the current production resource limits for Redis are adequate.
affinity: | ||
nodeAffinity: | ||
preferredDuringSchedulingIgnoredDuringExecution: | ||
- preference: | ||
matchExpressions: | ||
- key: role | ||
operator: In | ||
values: | ||
- control-plane | ||
weight: 1 | ||
nodeSelector: | ||
role: control-plane |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconsider scheduling pods on control-plane nodes.
Targeting control-plane nodes for application workloads is generally not recommended as it could impact cluster management operations.
Consider using worker nodes instead:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: role
operator: In
values:
- - control-plane
+ - worker
weight: 1
nodeSelector:
- role: control-plane
+ role: worker
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
affinity: | |
nodeAffinity: | |
preferredDuringSchedulingIgnoredDuringExecution: | |
- preference: | |
matchExpressions: | |
- key: role | |
operator: In | |
values: | |
- control-plane | |
weight: 1 | |
nodeSelector: | |
role: control-plane | |
affinity: | |
nodeAffinity: | |
preferredDuringSchedulingIgnoredDuringExecution: | |
- preference: | |
matchExpressions: | |
- key: role | |
operator: In | |
values: | |
- worker | |
weight: 1 | |
nodeSelector: | |
role: worker |
nodeSelector: | ||
role: high-cpu | ||
affinity: | ||
nodeAffinity: | ||
requiredDuringSchedulingIgnoredDuringExecution: | ||
nodeSelectorTerms: | ||
- matchExpressions: | ||
- key: role | ||
operator: In | ||
values: | ||
- high-cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using preferred node affinity instead of required
Using requiredDuringSchedulingIgnoredDuringExecution
might prevent pod scheduling if no high-cpu nodes are available. Consider using preferredDuringSchedulingIgnoredDuringExecution
for more flexibility.
nodeSelector:
role: high-cpu
affinity:
nodeAffinity:
- requiredDuringSchedulingIgnoredDuringExecution:
- nodeSelectorTerms:
- - matchExpressions:
- - key: role
- operator: In
- values:
- - high-cpu
+ preferredDuringSchedulingIgnoredDuringExecution:
+ - weight: 1
+ preference:
+ matchExpressions:
+ - key: role
+ operator: In
+ values:
+ - high-cpu
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
nodeSelector: | |
role: high-cpu | |
affinity: | |
nodeAffinity: | |
requiredDuringSchedulingIgnoredDuringExecution: | |
nodeSelectorTerms: | |
- matchExpressions: | |
- key: role | |
operator: In | |
values: | |
- high-cpu | |
nodeSelector: | |
role: high-cpu | |
affinity: | |
nodeAffinity: | |
preferredDuringSchedulingIgnoredDuringExecution: | |
- weight: 1 | |
preference: | |
matchExpressions: | |
- key: role | |
operator: In | |
values: | |
- high-cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay
Description
Move resource optimisation changes to Production
Summary by CodeRabbit
Chores
Refactor